This project is a data analysis culminating in the implementation of a series of Bayesian generalized linear models on data. The data is a series of observations related to masters program admissions in India. There are three separate models fit: Linear, Ordinal, and Beta regression.
In the picture above, we see the mathematical model of a simple linear regression. What makes this stand out as a Bayesian linear regression is that each of our parameters \(\beta_n, \sigma\) all have their own prior distributions. Each of these prior distributions then has their own set of hyperparameters.
This analysis is unabashedly Bayesian. Consider Bayes’ formula:
\(P(A | B) = \frac{P(B | A)P(A)}{P(B)}\)
We can reinterpret this as the following:
\(P(\mathcal{H} | \mathcal{D}) = \frac{P(\mathcal{D} | \mathcal{H})P(\mathcal{H})}{P(\mathcal{D})}\)
where \(\mathcal{H}\) is a hypothesis and \(\mathcal{D}\) is the data which gives evidence for or against \(\mathcal{H}\).
In the equation above:
The prior \(P(\mathcal{H})\) is the probability that our hypothesis is true before we look at our data
The posterior \(P(\mathcal{H} | \mathcal{D})\) is the probability that \(\mathcal{H}\) is true after the data is considered
The likelihood \(P(\mathcal{D} | \mathcal{H})\) is the evidence about our hypothesis provided by the data
\(P(\mathcal{D})\) is the probability of the data taking into account all possible hypotheses
If we know our prior and likelihood, then we can compute our posterior exactly. This gives us a deductive logic of probability, and allows us to compare hypotheses, draw conclusions and make decisions. In most cases we do not know the prior probabilities for any given hypothesis. As recourse, we use statistical inference; We make up priors (Bayesian),or we rely on only our likelihood (Frequentist).
Bayesian inference models uncertainty by a probability distribution over hypotheses. Our ability to make inferences depends on our degree of confidence in our chosen prior. In frequentist statistics we assume that some hypothesis is true and that the observed data is sampled from that distribution. Particularly, frequentist statistics do not depend on subjective priors.
For further contrast:
| Bayesian Inference | Frequentist Inference |
|---|---|
| Probability for both hypothesis and data | No prior or posterior |
| Depends on the prior and likelihood of observed data | Depends on the likelihood for both observed and unobserved data |
| Requires constructing a subjective prior | No prior |
| May be computationally expensive | Generally less computationally intensive |
In the past there has been much discourse between Statisticians who tended to fall firmly in one or the other camp. These days, while there is still discourse, there tends to be much more of a focus on pragmatism as opposed to philosophical ideals as to how statistical analysis should be performed.
Bayesian analysis consists of four steps:
Specify a joint distribution for the outcome and all the unknowns. This takes the form of a marginal distribution for each unknown multiplied by the likelihood for the outcome conditional on the unknowns.
Sample the posterior distribution using Markov Chain Monte Carlo techniques.
Evaluate the model fit on the data
Sample from the posterior predictive distribution of the outcome to get an idea of the values of the predictors. This allows us to understand how manipulating a predictor affects the outcome.
The idea behind this dataset is to predict admissions into a Masters degree program. It was sampled from Engineering students at an Indian university. The parameters are the following:
| parameter | range | description |
|---|---|---|
| GRE Score | 0-340 | Score on GRE exam |
| TOEFL Score | 0 - 120 | Score on TOEFL exam |
| University Ranking | 0 / 5 | Indian University Ranking |
| Statement of Purpose | 0 / 5 | Self assessed SOP score |
| Letter of Reccommendation | 0 / 5 | Self assessed LOR score |
| Undergraduate GPA | 0 / 10 | Cumulative undergraduate GPA |
| Research Experience | 0 or 1 | 1 if Student engaged in research, 0 otherwise |
| Chance of Admit | \(x \in [0, 1]\) | Likelihood of admission |
The source of this data is the following:
A Comparison of Regression Models for Prediction of Graduate Admissions
Mohan S Acharya, Asfia Armaan, Aneeta S Antony
IEEE International Conference on Computational Intelligence in Data Science 2019
# read data
admissions <- read_csv(data_location,
# coerce data types
col_types = list(col_integer(), col_integer(), col_integer(), col_integer(), col_double(), col_double(), col_double(), col_factor(), col_double()))
# rename features
admissions %>%
rename("Student" = "Serial No.",
"GRE" = "GRE Score",
"TOEFL" = "TOEFL Score",
"Rating" = "University Rating",
"GPA" = "CGPA",
"Chance" = "Chance of Admit") -> admissions
admissions %>% head(50)
# check for missing values
admissions %>% md.pattern()
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Student GRE TOEFL Rating SOP LOR GPA Research Chance
## 400 1 1 1 1 1 1 1 1 1 0
## 0 0 0 0 0 0 0 0 0 0
# grab colnames
admissions %>% select(-c("Student", "Research")) %>% colnames() -> adm_colnames
# make plotting function
plot_density <- function(variable){
admissions %>%
ggplot() +
geom_density(aes(x = !!sym(variable)), fill = color_scheme[1])
}
# get density plots for each variable
density_plots <- future_map(adm_colnames, ~plot_density(.x))
# make a special density plot for binary Research variable
admissions %>%
ggplot() +
geom_density(aes(x = Research, fill = Research), alpha = 0.5) +
scale_fill_manual(values = color_scheme) -> density_plots[[8]]
# plot
density_plots %>%
plot_grid(plotlist = ., ncol = 2)
# make plotting function
plot_points <- function(data, mapping, ...){
data %>%
ggplot(mapping = mapping) +
geom_point(fill = color_scheme[1], color = "black", pch = 21) +
geom_smooth(method = "gam", color = color_scheme[4]) +
scale_x_continuous(expand = expand_scale(mult = 0.3)) +
scale_y_continuous(expand = expand_scale(mult = 0.3))
}
# grab lower plots from ggpairs
ggpairs_lower <- function(g){
g$plots <- g$plots[-(1:g$nrow)]
g$yAxisLabels <- g$yAxisLabels[-1]
g$nrow <- g$nrow - 1
g$plots <- g$plots[-(seq(g$ncol, length(g$plots), by = g$ncol))]
g$xAxisLabels <- g$xAxisLabels[-g$ncol]
g$ncol <- g$ncol - 1
g
}
admissions %>%
select(-c("Student", "Research")) %>%
ggpairs(upper = NULL, diag = NULL,
lower = list(continuous = plot_points), progress = FALSE) %>%
ggpairs_lower()
# create color palette for corrplot
col_ramped <- colorRampPalette(color_scheme)
# select features to plot
admissions %>%
select(-c("Student", "Research")) %>%
cor() %>%
corrplot(method = "shade")
We see that most of the predictor variables have relatively high correlation.
| Statistic | GRE | TOEFL | Rating | SOP | LOR | GPA | Research | Chance |
|---|---|---|---|---|---|---|---|---|
| Min | 290.0 | 92.0 | 1.000 | 1.0 | 1.000 | 6.800 | 1:219 | 0.3400 |
| 1st Qu. | 308.0 | 103.0 | 2.000 | 2.5 | 3.000 | 8.170 | 0:181 | 0.6400 |
| Median | 317.0 | 107.0 | 3.000 | 3.5 | 3.500 | 8.610 | NA | 0.7300 |
| Mean | 316.8 | 107.4 | 3.087 | 3.4 | 3.453 | 8.599 | NA | 0.7244 |
| 3rd Qu. | 325.0 | 112.0 | 4.000 | 4.0 | 4.000 | 9.062 | NA | 0.8300 |
| Max | 340.0 | 120.0 | 5.000 | 5.0 | 5.000 | 9.920 | NA | 0.9700 |
Let’s look at the GRE score. We will use a conditionally normal probability density function to model the GRE score:
\(\frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} (\frac{y - \mu}{\sigma})^2}\)
where \(\mu = \alpha + x^T \beta\) is a linear predictor and \(\sigma\) is the standard deviation of the error in predicting the outcome \(y\).
More generally, we can use a linear predictor \(\eta = \alpha + x^T \beta\) which can be related to the mean of the outcome via a link function \(g\) which will serve as a map between the range of values on which the outcome is defined and the space upon which the linear predictor is defined. We will use link functions further in the analysis, but for now we can assume that \(g\) is simply the identity function.
In order to fit a full Bayesian model, we must specify prior distributions \(f(\alpha)\) and \(f(\beta)\) for the intercept and vector of regression coefficients. For this first model, I will use weakly informative priors:
\(\alpha \sim \mathrm{Normal}(0, 10)\)
\(\beta \sim \mathrm{Normal}(0, 5)\)
With independent prior distributions, our joint posterior for \(\alpha\) and \(\beta\) is proportional to the product of the priors and the \(n\) likelihood contributions:
\(f(\beta | y, X) \propto f(\alpha) \cdot \prod\limits_{k = 1}^K f(\beta_k) \cdot \prod\limits_{i=1}^N f(y_i | \eta_i)\)
where \(X\) is the matrix of predictors and \(\eta\) is the linear predictor.
First I will fit a simple model which can be visualized. This model will simply predict a students GRE score based on their GPA.
lin_mod_1 <- stan_glm(data = admissions, formula = GRE ~ GPA,
family = gaussian(link = "identity"),
chains = 4, cores = core_num, seed = 8888)
lin_mod_1
## stan_glm
## family: gaussian [identity]
## formula: GRE ~ GPA
## observations: 400
## predictors: 2
## ------
## Median MAD_SD
## (Intercept) 178.8 4.5
## GPA 16.0 0.5
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 6.4 0.2
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD 316.8 0.5
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
We will also want to see the uncertainty in the model. When we fit a Bayesian model, we are fitting many models and using the most likely estimates as our final model. One way we can show uncertainty is to plot the estimated regression line at each draw from the posterior distribution.
model_draws <- as_tibble(lin_mod_1) %>% set_names(c("a", "b"))
p1 <- ggplot(admissions, aes(y = GRE, x = GPA)) +
geom_point(size = 1, color = color_scheme[1]) +
geom_abline(intercept = coef(lin_mod_1)[1],
slope = coef(lin_mod_1)[2],
color = color_scheme[4], size = 1) +
ggtitle("Linear Model")
p2 <- p1 +
geom_abline(data = model_draws, aes(intercept = a, slope = b),
color = color_scheme[5], size = 0.2, alpha = 0.2) +
geom_abline(intercept = coef(lin_mod_1)[1],
slope = coef(lin_mod_1)[2],
color = color_scheme[4], size = 1) +
ggtitle("Linear Model + Uncertainty")
plot_grid(plotlist = list(p1, p2))
lin_mod_2 <- stan_glm(data = admissions, formula = GRE ~ GPA + TOEFL + SOP + LOR,
family = gaussian(link = "identity"),
chains = 4, cores = core_num, seed = 8888)
lin_mod_2
## stan_glm
## family: gaussian [identity]
## formula: GRE ~ GPA + TOEFL + SOP + LOR
## observations: 400
## predictors: 5
## ------
## Median MAD_SD
## (Intercept) 145.4 5.9
## GPA 8.9 0.9
## TOEFL 0.9 0.1
## SOP -0.4 0.5
## LOR 0.0 0.5
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 5.6 0.2
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD 316.8 0.4
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
In the two models above I added features arbitrarily. In order to do feature selection with Bayesian models, I will be using the projpred package which implements the projective variable selection for generalized linear models.
cvs <- cv_varsel(lin_mod_2, method = 'forward', cv_method = 'kfold', K = 10, seed = 8888, verbose = F)
paste0("Suggested Number of Variables: ", suggest_size(cvs))
## [1] "Suggested Number of Variables: 2"
varsel_plot(cvs, stats = c('elpd', 'rmse'))
Our variable selection process suggests a model consisting of 2 variables: 2, 1. Now we can fit the optimal model.
lin_mod_3 <- stan_glm(data = admissions, formula = GRE ~ GPA + TOEFL,
family = gaussian(link = "identity"),
chains = 4, cores = core_num, seed = 8888)
lin_mod_3
## stan_glm
## family: gaussian [identity]
## formula: GRE ~ GPA + TOEFL
## observations: 400
## predictors: 3
## ------
## Median MAD_SD
## (Intercept) 148.3 5.1
## GPA 8.6 0.8
## TOEFL 0.9 0.1
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 5.6 0.2
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD 316.8 0.4
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Now that we have two models, we can demonstrate how to compare our models. For this, we will use Leave One Out cross validation.
# CV model 1
loo_cv_1 <- loo(lin_mod_1, cores = core_num)
# CV model 2
loo_cv_2 <- loo(lin_mod_2, cores = core_num)
# CV model 3
loo_cv_3 <- loo(lin_mod_3, cores = core_num)
# compare our models
comparison <- compare_models(loo_cv_1, loo_cv_2, loo_cv_3)
comparison
##
## Model comparison:
## (ordered by highest ELPD)
##
## elpd_diff se_diff
## lin_mod_3 0.0 0.0
## lin_mod_2 -1.7 0.9
## lin_mod_1 -49.1 11.6
In this case, our third model is preferred, but only slightly over the second model. Thus, our best model is
GRE ~ GPA + TOEFL
elpd_diff is the Expected Log Predicted Density. It is given by the expression
\(\mathrm{elpd} = \sum\limits_{i = 1}^n \int p_t(\tilde{y}) \log p(\tilde{y_i} | y) d \tilde{y_i}\)
where \(p_t(\tilde{y_i})\) is the distribution of the true data generating process. In this case, since the true data generating process is unknown, we approximate it via leave one out cross validation:
\(\mathrm{elpd}_{loo} = \sum\limits_{i = 1}^n \log p(y_i | y_{-i})\)
where \(p(y_i | y_{-i}) = \int p(y_i |theta) p(\theta | y_{-i}) d\theta\)Our elpd_diff value is much higher than our standard error. Since the difference is so large, the second model fits the data much better.
rstanarm comes with a handy tool for checking plots and diagnostic criteria:
# launch_shinystan(lin_mod_2)
We can also generate some of the information in the shinystan app in our notebook.
The pp_check function generates a variety of plots comparing the observed outcome \(y\) to the simulated datasets \(y^{rep}\) from the posterior predictive distribution using the same observations of the predictors \(X\) as were used to fit the model.
p1 <- pp_check(lin_mod_1, nreps = 5) +
ggtitle("Simulated Posteriors of GRE") +
scale_color_manual(values = c(color_scheme[1], color_scheme[2])) +
scale_fill_discrete(labels = c("y", "y rep")) +
ggtitle("Model 1", subtitle = "GRE ~ GPA")
p2 <- pp_check(lin_mod_2, nreps = 5) +
ggtitle("Simulated Posteriors of GRE") +
scale_color_manual(values = c(color_scheme[1], color_scheme[2])) +
scale_fill_discrete(labels = c("y", "y rep")) +
ggtitle("Model 2", subtitle = "GRE ~ GPA + TOEFL + SOP + LOR")
p3 <- pp_check(lin_mod_3, nreps = 5) +
ggtitle("Simulated Posteriors of GRE") +
scale_color_manual(values = c(color_scheme[1], color_scheme[2])) +
scale_fill_discrete(labels = c("y", "y rep")) +
ggtitle("Model 3", subtitle = "GRE ~ GPA + TOEFL")
p4 <- pp_check(lin_mod_3, plotfun = "stat_2d", stat = c("mean", "sd")) + scale_fill_manual(values = c(color_scheme[1], color_scheme[5])) +
xlim(c(315, 318)) +
ggtitle("Mean & Std Dev", subtitle = "GRE ~ GPA + TOEFL")
plot_grid(plotlist = list(p1, p2, p3, p4), ncol = 2)
Now that we have a predictive distribution, we can use it to generate some new data predictions. To do this, I will generate some data and then predict the GRE scores based on this data. We will be using the third linear model (GRE ~ GPA + TOEFL) because it performed best.
# generate data
test_data <- tibble(
"GPA" = seq(from = 5.5, to = 10, by = 0.5),
"TOEFL" = seq(from = 75, to = 120, by = 5)
)
# generate predictions
test_preds <- posterior_predict(lin_mod_3, newdata = test_data)
# grab 100 predictions
test_preds %>% as_tibble() %>% slice(1:100) -> out_preds
# change column names to Student_#
out_preds %<>% set_colnames(paste("Student_", colnames(.), sep = ""))
# place predictions in a nested dataframe
out_preds %>% gather() %>% nest(value) -> out_preds
out_preds %<>% rename("Student" = key)
# add predictions to the dataframe
test_data %<>% add_column("Predictions" = out_preds)
test_data
Now we have a table with both our simulated predictors (GPA, TOEFL), but also a nested table of our predictions:
test_data$Predictions
In our predictions, our students are broken down into 10 groups in order of increasing scores for each of the predictors.
# create group plotter
group_plotter <- function(data, group) {
# get student number mod 4 for colors
st_num <- group %>% str_extract("[0-9]+") %>% as.numeric()
st_num_mod <- st_num %% 6 + 1
# plot
data %>%
filter(.[1] == !!group) %>%
unnest(data) %>%
ggplot(aes(x = value)) +
geom_density(fill = color_scheme[st_num_mod], alpha = 0.25) +
ggtitle(label = paste("Student Group", st_num)) +
xlim(c(225, 375))
}
# grab student names
students <- test_data$Predictions$Student
# create plots
future_map(students, ~group_plotter(test_data$Predictions, .x)) %>%
plot_grid(plotlist = ., ncol = 1)
Ordinal regression is a type of regression analysis that is used to predict an ordinal variable. An ordinal variable is one in which its value exists on an arbitrary scale where only the relative order of the values is significant. In machine learning this is often considered ranking learning. We can think of ordinal regression as a kind of in between technique between regression and classification.
For this use case, we will be predicting the University Rating, since it falls in the interval \(y \in [1, 5]\).
Ordinal outcomes fall into one of \(J\) categories. In our case, \(J = 5\). We can imagine how this model works by introducing a latent variable, \(y^*\) that is related to the observed outcomes via an observation mechanism:
\(y = \begin{cases} 1 & y^* < \xi_1 \\ 2 & \xi_1 \leq y^* \leq \xi_2 \\ \vdots \\ J & \xi_{j-1} \leq y^* \end{cases}\)
where \(\xi\) is a vector of cutpoints of length \(J - 1\).
Then we can model \(y^*\) as a linear function of \(K\) predictors
\(y^* = \mu + \epsilon = x^T \beta + \epsilon\)
where \(\epsilon\) has mean zero and unit scale but can be specified as being drawn from one of several distributions. There is also no intercept in this model, since the data cannot distinguish an intercept from the cutpoints.
From these assumptions we can derive the conditional distribution as
\(\begin{split} P(y = k | x) &= P(\xi_{k-1} < y^* \leq \xi_k | x) \\ &= P(\xi_{k-1} < w \cdot x + \epsilon \leq \xi_k) \\ &= \Phi(\xi_k - w \cdot x) - \Phi(\xi_{k-1} - w \cdot x) \end{split}\)
where \(\Phi\) is the cumulative distribution function of the standard normal distribution.
The main difference between an ordinal outcome and a linear outcome is that the scale of \(y^*\) is not identified by the data. Therefore, when we consider \(\epsilon\), we specify \(\sigma_\epsilon = 1\). This in turn implies that \(\sigma_{y^*} = \frac{1}{\sqrt{1 - R^2}}\).
Another difference is that we don’t have a global intercept (\(\alpha\) in our linear regression), but instead a vector of \(J-1\) cutpoints. For these cutpoints we will specify a Dirichlet prior on \(P(y = j | \bar{x})\), which can be stated as the prior probability of the outcome falling in each of the \(J\) categories given that the predictors are at their sample means. The Dirichlet prior is for a simplex random variable, with non negative elements that sum to 1. It can be written as
where \(\pi\) is a simplex vector such that \(\pi_j = P(y = j | \bar{x})\), and \(\alpha\) is a vector of concentration hyperparameters that we can interpret as prior counts. For example, if \(\alpha_j = 1\) for all \(j \in J\), then the Dirichlet prior simply says we have a jointly uniform distribution over the space of these simplexes. This is equivalent to saying that one observation falls into each of the \(J\) ordinal categories when the predictors are at their sample means.
We can then get the \(j\)th cutpoint by
where \(F_{y^*}^{-1}\) is an inverse CDF function, depending on the assumed distribution of \(y^*\) (which we defined earlier as normally distributed, but could just as easily be logistic or multinomial). Our scale parameter
Our scale paremeter for \(\xi_j\) is also \(\sigma_{y^*} = \frac{1}{\sqrt{1 - R^2}}\).
In order to get good results with ordinal regression, it helps to scale the values.
# coerce Rating to factor and scale variables
admissions_scaled <- admissions %>%
mutate(Rating = Rating %>% as_factor(),
GRE = GRE %>% scale(),
TOEFL = TOEFL %>% scale(),
SOP = SOP %>% scale(),
LOR = LOR %>% scale(),
GPA = GPA %>% scale(),
Chance = Chance %>% scale())
# look at scaled data
admissions_scaled %>% head()
Rating ~ SOP + LOR + Research
# fit model
ord_mod_1 <- stan_polr(Rating ~ SOP + LOR + Research, data = admissions_scaled,
prior = R2(0.005), prior_counts = dirichlet(90),
chains = 12, cores = core_num, seed = 8888)
ord_mod_1
## stan_polr
## family: ordered [logistic]
## formula: Rating ~ SOP + LOR + Research
## observations: 400
## ------
## Median MAD_SD
## SOP 0.7 0.1
## LOR 0.3 0.1
## Research0 -0.3 0.2
##
## Cutpoints:
## Median MAD_SD
## 1|2 -2.7 0.2
## 2|3 -0.9 0.1
## 3|4 0.6 0.1
## 4|5 1.9 0.1
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD:1 0.1 0.0
## mean_PPD:2 0.2 0.0
## mean_PPD:3 0.3 0.0
## mean_PPD:4 0.2 0.0
## mean_PPD:5 0.1 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
As an interpretation, suppose we got the median results for SOP (0.09), LOR(0.05), and Research(1). Then
SOP + LOR + Research = 0.6 * SOP + 0.3 * LOR + 0.3 * Research = 0.6 * 0.09 + 0.3 * 0.05 + 0.3 * 1 = 0.369
which places us in the 3rd cut, for a university ranking of 3.
In order to compare this with a different model, I will fit a second model. This second model will predict the University ranking from all the available numeric predictors.
Rating ~ GRE + TOEFL + SOP + LOR + GPA + Chance + Research
# fit model 2
ord_mod_2 <- stan_polr(Rating ~ GRE + TOEFL + SOP + LOR + GPA + Chance + Research, data = admissions_scaled, prior = R2(0.005), prior_counts = dirichlet(90), chains = 12, cores = core_num, seed = 8888)
ord_mod_2
## stan_polr
## family: ordered [logistic]
## formula: Rating ~ GRE + TOEFL + SOP + LOR + GPA + Chance + Research
## observations: 400
## ------
## Median MAD_SD
## GRE 0.0 0.1
## TOEFL 0.0 0.1
## SOP 0.1 0.1
## LOR 0.0 0.0
## GPA 0.0 0.1
## Chance 0.0 0.1
## Research0 0.0 0.1
##
## Cutpoints:
## Median MAD_SD
## 1|2 -1.9 0.1
## 2|3 -0.6 0.1
## 3|4 0.5 0.1
## 4|5 1.6 0.1
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD:1 0.1 0.0
## mean_PPD:2 0.2 0.0
## mean_PPD:3 0.3 0.0
## mean_PPD:4 0.2 0.0
## mean_PPD:5 0.2 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
p1 <- plot(ord_mod_1) +
ggtitle("Ordinal Model 1")
p2 <- plot(ord_mod_2) +
ggtitle("Ordinal Model 2")
plot_grid(plotlist = list(p1, p2), ncol = 2)
# leave one out CV
loo_cv_1 <- loo(ord_mod_1, cores = core_num)
loo_cv_2 <- loo(ord_mod_2, cores = core_num)
# compare models
(comparison <- compare_models(loo_cv_1, loo_cv_2))
##
## Model comparison:
## (negative 'elpd_diff' favors 1st model, positive favors 2nd)
##
## elpd_diff se
## -102.5 5.2
In this case, our first model is preferred strongly over our second model.
p1 <- pp_check(ord_mod_1, nreps = 50) +
ggtitle("Ordinal Model 1") +
scale_color_manual(values = c(color_scheme[4], color_scheme[1])) +
scale_fill_discrete(labels = c("y", "y rep")) +
theme(legend.position = "null")
p2 <- pp_check(ord_mod_2, nreps = 50) +
ggtitle("Ordinal Model 2") +
scale_color_manual(values = c(color_scheme[4], color_scheme[1])) +
scale_fill_discrete(labels = c("y", "y rep")) +
theme(legend.position = "top")
plot_grid(plotlist = list(p1, p2))
Beta Regression is a form of regression that is used when we wish to model some value y where \(y \in [0, 1]\), or y is a proportion or rate.
Beta regression uses the beta distribution as its likelihood function:
where \(\beta\) is the beta function:
The shape parameters for this distribution are \((a, b)\), and they enter into this model with the following transformations:
\(a = \mu \cdot \phi\)
\(b = (1 - \mu) \cdot \phi\)Then, regarding \(\mu\):
Let \(g_1(\cdot)\) be some link function. In the specification of the shape parameters above, \(\mu = g_1^{-1}(X\beta)\) where \(X\) is an \(N \times K\) dimensional matrix of predictors, and \(\beta\) is a \(K\) dimensional vector of parameters associated with each predictor. In the simplest case, \(\phi\) is a scalar parameter. We can also model \(\phi\) using a second set of independent variables \(Z\). In this case we can let \(g_2(\cdot)\) be some link function that is not necessarily identical to \(g_1(\cdot)\). Then \(\phi = g_2^{-1}(Z\gamma)\), where \(\gamma\) is a \(J\) dimensional vector of parameters associated with the \(N \times J\) dimensional matrix of predictors \(Z\).
After substituting the shape parameters in, the likelihood function for the beta regression takes the form:
For a full Bayesian analysis, we need to specify \(f(\beta_1)\) and \(f(\phi)\) for the vector of independent variable coefficients and phi. These can be set in our stan model with prior_intercept, prior, and prior_phi arguments.
When we model \(\phi\) with a linear predictor a full Bayesian analysis requires also specifying \(f(\gamma)\) and \(f(\beta_2)\). These can be set in stan with the prior_intercept_z and prior_z arguments.
Let us suppose that our coefficients are just as likely to be positive as they are to be negative, but are unlikely to be far from 0. These beliefs can be represented with normal distributions:
\(f(\beta_1) \sim \mathrm{normal}(0, 2.5)\)
\(f(\phi) \sim \mathrm{normal}(0, 2.5)\)
\(\phi \sim \mathrm{normal}(0, 10)\)
For a single set of independent variables, we have the following posterior:
where the posterior of \(\beta\) and \(\phi\) is proportional to the product of the likelihood contributions, the \(K\) priors on the \(\beta_k\) parameters and \(\phi\).
With two sets of independent variables, we have the following posterior:
where the posterior of \(\beta\) and \(\gamma\) is proportional to the product of the likelihood contribution, the \(K\) priors on the \(\beta_k\) parameters, and the \(J\) priors on the \(\gamma_j\) parameters.
# coerce Rating to factor and scale variables
admissions_beta <- admissions %>%
mutate(Rating = Rating %>% as_factor(),
GRE = GRE %>% scale(),
TOEFL = TOEFL %>% scale(),
SOP = SOP %>% scale(),
LOR = LOR %>% scale(),
GPA = GPA %>% scale())
admissions_beta %>% head(50)
Chance ~ GRE + GPA
beta_mod_1 <- stan_betareg(Chance ~ GRE + GPA, data = admissions_beta, link = "logit",
chains = 12, cores = core_num, seed = 8888, prior_intercept = normal(0, 2.5), prior_phi = normal(0, 2.5), prior = normal(0, 20))
beta_mod_1
## stan_betareg
## family: beta [logit, link.phi=identity]
## formula: Chance ~ GRE + GPA
## observations: 400
## ------
## Median MAD_SD
## (Intercept) 1.0 0.0
## GRE 0.2 0.0
## GPA 0.5 0.0
## (phi) 23.3 1.4
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD 0.7 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
As an interpretation of this model, suppose we were to take the first row coefficients from admissions_beta:
\(\mathrm{GRE} = 1.75 \quad \mathrm{|} \quad \mathrm{GPA} = 1.76\)
Then we need to transform our scale. Given that we have a logit link, we use:
The beta coefficients are the additional increase or decrease in the log odds of our response. Then to interpret them, we need to transform back our response:
Then
\(logit(\mathrm{E}(y)) = 1.03 + 0.2 * GRE + 0.48 * GPA\)
\(\frac{e^{1.03 + 0.2 * GRE + 0.48 * GPA}}{1 + e^{1.03 + 0.2 * GRE + 0.48 * GPA}}\)
\(\frac{e^{1.03 + 0.2 * 1.75 + 0.48 * 1.76}}{1 + e^{1.03 + 0.2 * 1.75 + 0.48 * 1.76}}\)which gives us 0.902, which is very close to our actual value of 92%
Chance ~ GRE + GPA + TOEFL + SOP + LOR
beta_mod_2 <- stan_betareg(Chance ~ GRE + GPA + TOEFL + SOP + LOR, data = admissions_beta, link = "logit",
chains = 12, cores = core_num, seed = 8888, prior_intercept = normal(0, 2.5), prior_phi = normal(0, 20), prior = normal(0, 0.25))
beta_mod_2
## stan_betareg
## family: beta [logit, link.phi=identity]
## formula: Chance ~ GRE + GPA + TOEFL + SOP + LOR
## observations: 400
## ------
## Median MAD_SD
## (Intercept) 1.1 0.0
## GRE 0.1 0.0
## GPA 0.4 0.0
## TOEFL 0.1 0.0
## SOP 0.0 0.0
## LOR 0.1 0.0
## (phi) 42.7 2.9
##
## Sample avg. posterior predictive distribution of y:
## Median MAD_SD
## mean_PPD 0.7 0.0
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Then, interpreting this new model as we did above – but with the second row of the admissions_beta data:
\(\mathrm{GRE} = 0.62 \quad \mathrm{|} \quad \mathrm{TOEFL} = -0.06 \quad \mathrm{|} \quad \mathrm{SOP} = 0.59 \quad \mathrm{|} \quad \mathrm{LOR} = 1.16 \quad \mathrm{|} \quad \mathrm{GPA} = 0.45\)
which gives us 0.79, which is somewhat close to our actual value of 76%
loo_cv_1 <- loo(beta_mod_1, cores = core_num)
loo_cv_2 <- loo(beta_mod_2, cores = core_num)
# compare
(comparison <- compare_models(loo_cv_1, loo_cv_2))
##
## Model comparison:
## (negative 'elpd_diff' favors 1st model, positive favors 2nd)
##
## elpd_diff se
## 38.0 8.9
Our CV indicates that our second model fits best.
p1 <- pp_check(beta_mod_1, nreps = 50) +
ggtitle("Beta Regression 1") +
scale_color_manual(values = c(color_scheme[4], color_scheme[1])) +
scale_fill_discrete(labels = c("y", "y rep")) +
theme(legend.position = "null")
p2 <- pp_check(beta_mod_2, nreps = 50) +
ggtitle("Beta Regression 2") +
scale_color_manual(values = c(color_scheme[4], color_scheme[1])) +
scale_fill_discrete(labels = c("y", "y rep")) +
theme(legend.position = "top")
plot_grid(plotlist = list(p1, p2))
Read More:
Bayesian vs. Frequentist:
Data:
STAN:
Regression: